Common Information: 1. Is for numeric values 2. Linear regression is based on the assumption that a linear relationship exists between the input and output variable. 3. Learning Alg. will learn the set of parameters such that the sum of square error(yactual - yestimate)2 is minimized.
Load Prestige Data:
library(car)
Prestige
summary(Prestige)
education income women prestige census type
Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80 Min. :1113 bc :44
1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 1st Qu.:3120 prof:31
Median :10.540 Median : 5930 Median :13.600 Median :43.60 Median :5135 wc :23
Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83 Mean :5402 NA's: 4
3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27 3rd Qu.:8312
Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20 Max. :9517
pairs(main="Prestige Data", Prestige)
plot.ts(main="Prestige Data", Prestige)
In correlation matrix we see, that education and income in relation to prestige standing.
d <- data.frame(education = Prestige$education, income = Prestige$income, women = Prestige$women, prestige = Prestige$prestige, census = Prestige$census)
m <- cor(d)
library('corrplot')
corrplot(m, method="circle")
Education and income has a high influence to the prestige !!!
Example 1 (Prestige).
Prepare Data Prestige: Build index which later use only 1/4 from data as Training.
testidx_pre <- which(1:nrow(Prestige)%%4==0)
Split data:
prestige_train <- Prestige[-testidx_pre,]
prestige_test <- Prestige[testidx_pre,]
Build a learn model: (for prestige attribute)
model <- lm(prestige~., data=prestige_train)
Use model for prediction the test data
prediciton <- predict(model, newdata=prestige_test)
Check correlation with current result, but why ?
cor(prediciton, prestige_test$prestige)
[1] 0.9376719
Show result:
summary(model)
Call:
lm(formula = prestige ~ ., data = prestige_train)
Residuals:
Min 1Q Median 3Q Max
-13.9079 -5.0336 0.3159 5.3831 17.8852
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.071e+01 1.142e+01 -1.813 0.07437 .
education 4.201e+00 8.291e-01 5.067 3.49e-06 ***
income 1.150e-03 3.511e-04 3.277 0.00168 **
women 3.630e-02 4.006e-02 0.906 0.36817
census 1.865e-03 9.913e-04 1.881 0.06442 .
typeprof 1.131e+01 7.393e+00 1.530 0.13075
typewc 1.987e+00 4.958e+00 0.401 0.68984
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.416 on 66 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.8204, Adjusted R-squared: 0.8041
F-statistic: 50.26 on 6 and 66 DF, p-value: < 2.2e-16
curve(4.201e+00*x + 1.150e-03,from=1, to=100, xlab="x", ylab="y")
plot(model)
prediciton
purchasing.officers architects draughtsmen social.workers ministers physicians
51.30093 75.85178 54.71590 63.92539 61.74322 92.95148
nursing.aides commercial.artists typists shipping.clerks postal.clerks travel.clerks
31.61065 51.26625 44.32374 33.43635 37.56363 45.73771
sales.clerks real.estate.salesmen cooks launderers farm.workers slaughterers.2
38.52429 46.44319 28.69344 27.54240 31.72468 33.23791
tool.die.makers auto.workers sewing.mach.operators electrical.linemen masons pilots
46.48374 37.75900 28.62746 43.20600 30.26737 75.30271
longshoremen
37.28700
Example 2 (Census).
Prepare Data Prestige/Census: Build index which later use only 1/4 from data as Training.
testidx_pre <- which(1:nrow(Prestige)%%4==0)
Split data:
prestige_train <- Prestige[-testidx_pre,]
prestige_test <- Prestige[testidx_pre,]
Build a learn model: (for census attribute)
model <- lm(census~., data=prestige_train)
Use model for prediction the test data
prediciton <- predict(model, newdata=prestige_test)
Check correlation with current result, but why ?
cor(prediciton, prestige_test$census)
[1] 0.7521363
Show result:
summary(model)
Call:
lm(formula = census ~ ., data = prestige_train)
Residuals:
Min 1Q Median 3Q Max
-2314.6 -280.6 137.6 555.8 1660.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.120e+03 8.624e+02 10.576 7.61e-16 ***
education -2.249e+02 1.149e+02 -1.957 0.0546 .
income -2.631e-02 4.568e-02 -0.576 0.5666
women -1.740e-01 4.876e+00 -0.036 0.9716
prestige 2.728e+01 1.451e+01 1.881 0.0644 .
typeprof -5.135e+03 6.547e+02 -7.844 5.04e-11 ***
typewc -3.377e+03 4.334e+02 -7.791 6.26e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 897.1 on 66 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.8996, Adjusted R-squared: 0.8905
F-statistic: 98.6 on 6 and 66 DF, p-value: < 2.2e-16
curve(-5.135e+03*x + -3.377e+03,from=1, to=10000, xlab="x", ylab="y")
plot(model)
prediciton
purchasing.officers architects draughtsmen social.workers ministers physicians
2731.290 2270.106 2668.436 2116.075 2585.987 2106.844
nursing.aides commercial.artists typists shipping.clerks postal.clerks travel.clerks
7842.070 2884.537 4203.111 4397.013 4386.272 3975.482
sales.clerks real.estate.salesmen cooks launderers farm.workers slaughterers.2
4126.262 4346.245 8098.660 7948.086 7724.172 8213.242
tool.die.makers auto.workers sewing.mach.operators electrical.linemen masons pilots
7798.560 8048.416 8363.928 7981.632 8466.525 2659.409
longshoremen
7824.743
Example 3 (Prestige).
Prepare Data Prestige: Build index which later use only 1/4 from data as Training.
testidx_pre <- which(1:nrow(Prestige)%%10==0)
Split data:
prestige_train <- Prestige[-testidx_pre,]
prestige_test <- Prestige[testidx_pre,]
Build a learn model: (for prestige attribute)
model <- lm(prestige~., data=prestige_train)
Use model for prediction the test data
prediciton <- predict(model, newdata=prestige_test)
Check correlation with current result, but why ?
cor(prediciton, prestige_test$prestige)
[1] 0.9274643
Show result:
summary(model)
Call:
lm(formula = prestige ~ ., data = prestige_train)
Residuals:
Min 1Q Median 3Q Max
-13.060 -5.405 1.202 5.051 19.119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.295e+01 8.455e+00 -1.532 0.12948
education 3.957e+00 6.907e-01 5.729 1.67e-07 ***
income 1.012e-03 2.775e-04 3.645 0.00047 ***
women 1.450e-02 3.244e-02 0.447 0.65612
census 1.192e-03 6.420e-04 1.857 0.06695 .
typeprof 1.076e+01 4.815e+00 2.235 0.02819 *
typewc 6.187e-01 3.355e+00 0.184 0.85418
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.107 on 81 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.8394, Adjusted R-squared: 0.8275
F-statistic: 70.56 on 6 and 81 DF, p-value: < 2.2e-16
curve(4.201e+00*x + 1.150e-03,from=1, to=100, xlab="x", ylab="y")
plot(model)
prediciton
mining.engineers ministers pharmacists shipping.clerks sales.supervisors cooks bakers
69.47141 62.97919 72.66327 33.88479 40.53581 28.87879 31.40611
auto.workers construction.foremen longshoremen
36.63099 39.11454 36.07945
Example 4 (Census).
Prepare Data Prestige/Census: Build index which later use only 1/4 from data as Training.
testidx_pre <- which(1:nrow(Prestige)%%10==0)
Split data:
prestige_train <- Prestige[-testidx_pre,]
prestige_test <- Prestige[testidx_pre,]
Build a learn model: (for census attribute)
model <- lm(census~., data=prestige_train)
Use model for prediction the test data
prediciton <- predict(model, newdata=prestige_test)
Check correlation with current result, but why ?
cor(prediciton, prestige_test$census)
[1] 0.9451727
Show result:
summary(model)
Call:
lm(formula = census ~ ., data = prestige_train)
Residuals:
Min 1Q Median 3Q Max
-4040.9 -384.0 141.7 602.9 5759.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.975e+03 9.407e+02 10.604 < 2e-16 ***
education -3.830e+02 1.321e+02 -2.899 0.00481 **
income 7.575e-03 5.075e-02 0.149 0.88171
women -5.273e+00 5.474e+00 -0.963 0.33824
prestige 3.425e+01 1.844e+01 1.857 0.06695 .
typeprof -4.292e+03 6.927e+02 -6.195 2.29e-08 ***
typewc -2.592e+03 4.905e+02 -5.285 1.04e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1205 on 81 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.8129, Adjusted R-squared: 0.7991
F-statistic: 58.67 on 6 and 81 DF, p-value: < 2.2e-16
curve(-4.292e+03*x + -2.592e+03,from=1, to=10000, xlab="x", ylab="y")
plot(model)
prediciton
mining.engineers ministers pharmacists shipping.clerks sales.supervisors cooks bakers
2511.492 2637.236 2180.507 4904.864 5002.044 7777.201 8275.712
auto.workers construction.foremen longshoremen
7948.091 8633.108 7699.212
Columns with not marked with least one * can be safely ignored.
END
Goal of minimizing the square error makes linear regression very sensitive to outliers that greatly deviate in the output.